In this paper, we tackle the problem of RGB-D semantic segmentation of indoorimages. We take advantage of deconvolutional networks which can predictpixel-wise class labels, and develop a new structure for deconvolution ofmultiple modalities. We propose a novel feature transformation network tobridge the convolutional networks and deconvolutional networks. In the featuretransformation network, we correlate the two modalities by discovering commonfeatures between them, as well as characterize each modality by discoveringmodality specific features. With the common features, we not only closelycorrelate the two modalities, but also allow them to borrow features from eachother to enhance the representation of shared information. With specificfeatures, we capture the visual patterns that are only visible in one modality.The proposed network achieves competitive segmentation accuracy on NYU depthdataset V1 and V2.
展开▼